Frustratingly Short Attention Spans in Neural Language Modeling
نویسندگان
چکیده
Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning midand long-range dependencies. However, conventional attention mechanisms used in memoryaugmented neural language models produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memoryaugmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.
منابع مشابه
Distraction-Based Neural Networks for Modeling Documents
Distributed representation learned with neural networks has recently shown to be effective in modeling natural languages at fine granularities such as words, phrases, and even sentences. Whether and how such an approach can be extended to help model larger spans of text, e.g., documents, is intriguing, and further investigation would still be desirable. This paper aims to enhance neural network...
متن کاملAttention-based Memory Selection Recurrent Network for Language Modeling
Recurrent neural networks (RNNs) have achieved great success in language modeling. However, since the RNNs have fixed size of memory, their memory cannot store all the information about the words it have seen before in the sentence, and thus the useful longterm information may be ignored when predicting the next words. In this paper, we propose Attention-based Memory Selection Recurrent Network...
متن کاملDistraction-Based Neural Networks for Document Summarization
Distributed representation learned with neural networks has recently shown to be effective in modeling natural languages at fine granularities such as words, phrases, and even sentences. Whether and how such an approach can be extended to help model larger spans of text, e.g., documents, is intriguing, and further investigation would still be desirable. This paper aims to enhance neural network...
متن کاملShort Term Load Forecasting by Using ESN Neural Network Hamedan Province Case Study
Abstract Forecasting electrical energy demand and consumption is one of the important decision-making tools in distributing companies for making contracts scheduling and purchasing electrical energy. This paper studies load consumption modeling in Hamedan city province distribution network by applying ESN neural network. Weather forecasting data such as minimum day temperature, average day temp...
متن کاملAutoregressive Attention for Parallel Sequence Modeling
We introduce an autoregressive attention mechanism for parallelizable characterlevel sequence modeling. We use this method to augment a neural model consisting of blocks of causal convolutional layers connected by highway network skip connections. We denote the models with and without the proposed attention mechanism respectively as Highway Causal Convolution (Causal Conv) and Autoregressive-at...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1702.04521 شماره
صفحات -
تاریخ انتشار 2017